AIML421, Assignment 4 (Part A), Corvin Idler, Student ID 300598312, idlercorv@myvuw.ac.nz

Installing pandas profiling tools to colab

Importing everything I need

Load Data, having a peekie and setting the index column (Unnamed: 0) as index for the data frame and then drop it

Initial Data Analysis + Exploratory Data Analysis

We doesn't seem to have any sizable amount of missing data. There are some entries with zero X, Y or Z dimensions. We could exclude them from the data set or estimate them e.g. in a MICE fashion. I ended up opting for imputation. Some numerical variables are highly skewed (including the target variable). Some intersting domain reading https://www.diamonds.pro/guides/diamond-proportion/

Pre-precossing:

Split test and training data set Need to turn categorical variables into numerical

Filling in missing values in a MICE fashion. Normalizing numeric variables as many are highly non-normal and some heavily skewed. Using box cox and quantile transformation. Code mainly taken from https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html

It goes without saying that the quantile transformation leads to better results in terms of "normalizing" the distribution. Will do a test lateron if that also translates into better prediction results.

Outlier Analysis

inspired by https://www.kaggle.com/heeraldedhia/regression-on-diamonds-dataset-95-score/notebook I did an outlier analysis and there are certainly some. Many of them are due to the missing entries (zero on x, y, z) and a lot of the others are univariate outliers (in one dimension). Given the amount of overall datapoints I don't think I need to spent too much time eliminating the outliers from the training set and I would be tempted to do this (if at all) only for multivariate outliers... which complicates a bit the outlier identification (e.g. statsitical multivariate quality control algorithms or machine learning algorithms like single forest etc). Long story short, I decided to leave things and see what happens.

Inspired by the same source as the above charts I thought it would be fun to plot the volume against the weight to identify really oddly shaped diamonds. I would argue those might be more "meaningful" outliers. But again, given the overall number of data points compared to the number of outliers I decided leave them be to just see what happens.

Modelling

First thing I wanted to find out was if there was a meaningful difference between raw data, box cox normalisation and quantile transformation. As a test case I used OLS. Turns out that quantile might have destroyed some meaningful relationships in the data or "overfit" the training data. So the box-cox transfromation leads to better results in the particularly scenario. I didn't have enough computing time, but it would be interesting to do a meta study across the various algorithms to see if above finding holds true across all of them (at least for this particular data set).

Looks like box cox is the winner. Will use that from hereon